IEEE Transactions on Computational Biology and Bioinformatics — Latest Matching Preprints

1

HGGT：Heterogeneous Gated Graph Transformer for Predicting Clinical Trial Success

Qian, L.; Lu, X.; Haris, P.; Yang, Y.

2026-07-01 health informatics 10.64898/2026.06.28.26356795 medRxiv

Top 0.1%

3.4%

Show abstract

Clinical trials are critical milestones in the drug development pipeline, yet their high failure rates and substantial costs underscore the need for robust predictive models. This study introduces a Heterogeneous Gated Graph Transformer (HGGT) model tailored to predict clinical trial success. Unlike existing methods that typically model trial-related entities in isolation or with homogeneous graphs, HGGT explicitly models the rich heterogeneous relationships among trials, diseases, drugs, genes, targets, abstracts, and eligibility criteria through a gated graph transformer architecture, which dynamically learns and weights multi-type relational interactions to capture complex biological and clinical dependencies. By integrating heterogeneous graph representation with transformer-based context modeling, HGGT effectively captures non-linear, multi-scale interactions across biomedical entities, leading to improved predictive performance for trial success. Experimental results demonstrate that the HGGT model achieves strong performance, with the highest PR-AUC, F1 score, and ROC-AUC across three phases. These findings highlight the potential of graph-based deep learning approaches in optimizing clinical trial design and resource allocation, ultimately accelerating the translation of novel therapies into clinical practice.

2

BertST: BERT-based Spatial Domain Identification in Patient Data

Nnadi, G. O.

2026-07-09 bioinformatics 10.64898/2026.07.04.736527 medRxiv

Top 0.2%

2.4%

Show abstract

Spatial transcriptomics enables the study of gene expression within its native tissue context, providing critical insights into cellular organization and microenvironment-driven biological processes. A key challenge in this field is spatial domain identification, which aims to partition tissue into coherent regions by jointly leveraging gene expression and spatial information. Existing approaches are predominantly based on Graph Neural Networks (GNNs), and approach based on Transformers particularly, Bidirectional Encoder Reppresentation Transformer (BERT) model for modelling both local and long-range dependencies remains largely unexplored. In this work, we propose BERT for Spatial Transcriptomics (BertST), a transformer-based framework that reformulates spatial transcriptomics as a graph-to-text representation learning problem. Building upon the BERTwalk paradigm, we construct a task-specific multi-graph representation integrating spatial adjacency, pruned gene-expression similarity, and a fully connected gene-expression graph. This design enables the modelling of both local spatial structure and global molecular relationships. Random walks over these graphs are treated as sequences, allowing a BERT model to learn contextualised node embeddings. To further enhance representation quality, we introduce a hierarchical multi-graph propagation strategy, where embedding refinement is performed sequentially: first on the fully connected graph to capture global structure, followed by the pruned graph to refine molecular relationships, and finally on the spatial graph to enforce local smoothness. This ordering ensures that global information is effectively distributed and progressively constrained by biologically meaningful neighbourhoods. We also improve computational efficiency by leveraging \textit{PecanPy}, a fast and scalable implementation of node2vec, enabling efficient random walk generation on dense graphs. Experimental results on multiple 10x Visium datasets, including DLPFC and Human Breast Cancer, demonstrate that BertST consistently outperforms or matches GNN-based methods such as ConST, CCST, and SpaceFlow in terms of Adjusted Rand Index (ARI) and Adjusted Mutual Information (AMI). Overall, BertST highlights the potential of transformer-based architectures for spatial omics analysis by effectively capturing both local and long-range spatial-molecular dependencies, offering a promising alternative to traditional graph-based methods.

3

Infoxmed2.0-27B: Instruction Tuning, Preference Alignment, and GRPO-Based Reward Model Training for Medical LLMs

Xie, J.; Guo, Z.; Zhao, H.; Ni, H.

2026-06-30 medical education 10.64898/2026.06.25.26356522 medRxiv

Top 0.2%

1.8%

Show abstract

Abstract-Large language models (LLMs) [1], [2] have demon strated remarkable capabilities across general domains, yet their application in specialized medical contexts demands rigorous domain adaptation [3], [4]. We present Infoxmed2.0-27B, a medical foundation model built upon Qwen3.5-27B [5] through a comprehensive multi-stage post-training pipeline: (1) proprietary medical data synthesis from a MySQL database with MedicalCategoryTree organization, medical PhD team validation, Chinese RoBERTa [6] semantic deduplication, and API-assisted language refinement; (2) instruction supervised fine-tuning of Qwen3.5- 27B via LoRA [7] (r = 8, = 32) using MS-Swift [8], producing iterations Infoxmed2.0.0[->]2.0.2[->]2.0.4; (3) Direct Preference Optimization (DPO) [9] on 6,283 curated medical preference pairs [10] using DPO-RPO loss ({beta} = 0.3, RPO = 0.1) across eight progressive training iterations (v0-v7); and (4) parallel Group Relative Policy Optimization (GRPO) [11]-based medical reward model training on Qwen3.5 combining internal rule-based reward functions with external DeepSeek signals. Comprehensive evaluations under a uniform LLM-as-Judge [12] framework with GPT-5.4 demonstrate 77.0% accuracy (mean quality score +7.18) on MedMCQA [10] and +2.59 on HLE, with pipeline progression from +6.69 (base) to +7.06 (SFT) to +7.18 (final).

4

GR-SAFS: A Graph-Regularized Stacking Framework with Adaptive Feature Selection for High-Dimensional Prognostic Biomarker Discovery

He, J.; Guan, J.

2026-06-28 bioinformatics 10.64898/2026.06.23.733986 medRxiv

Top 0.3%

1.5%

Show abstract

Identifying prognostic biomarkers from high-dimensional transcriptomic data poses a triple challenge: achieving sparsity, preserving biological network topology, and integrating complementary nonlinear signals. Existing methods typically ignore network structure, miss nonlinear interactions, or lack a principled mechanism to fuse heterogeneous model outputs. We introduce GR-SAFS (Graph-Regularized Stacking with Adaptive Feature Selection), a framework with three modules: a Graph-Lasso engine embedding gene co-expression network Laplacian priors, run in parallel with a Random Forest engine; an empirical cumulative distribution function (eCDF) alignment layer that places sparse and dense importances on a common percentile scale; and a diversity-penalized quadratic programming router whose strict convexity yields a unique global optimum. On the TCGA-LUAD cohort, GR-SAFS identifies a 20-gene signature with a training concordance index of 0.700. Across two independent crossplatform microarray cohorts, GR-SAFS is the only method whose frozen signature retains statistically significant risk stratification in every cohort, where stronger-C-index baselines lose significance on at least one external cohort. Functional enrichment anchors the signature to a coherent Wnt/{beta};-catenin axis. An open-source implementation is released for full reproducibility.

5

Predicting subclonal TP53 mutations from tumor spatial transcriptomics data using a graph convolutional neural network

Luijts, T.; Hoogstoel, S.; Pappaert, E.; De Meester, E.; Van Nieuwerburgh, F.; Van Hamme, E.; De Schepper, S.; Willaert, W.; Vral, A.; Hoorens, I.; Van den Eynden, J.

2026-07-09 cancer biology 10.64898/2026.07.08.737173 medRxiv

Top 0.4%

1.3%

Show abstract

Spatial transcriptomics (ST) has revolutionized our understanding of tumor biology but inherently lacks information on the upstream somatic driver mutations. We developed a spatially-aware graph convolutional neural network (MuT-GCNN) that infers TP53 clones directly from ST data. MuT-GCNN was trained on virtual ST slides with clones simulated from a large collection of existing RNA and matched DNA sequencing data. The model is highly performant with precision and recall values exceeding 95% in most analysed cancer types. It is sensitive for single hit mutations and is primarily informed by the expression of p53 signalling genes in cancer cells. After demonstrating the potential of the model on publicly available squamous cell carcinoma (SCC) data, a direct validation was performed using ST and matched DNA sequencing from serial slices obtained from 4 cutaneous SCC samples. With the increasing availability of ST data and upcoming ST atlases, MuT-GCNN can unveil the location of (sub)clonal alterations in TP53, the most frequently mutated gene in human cancer.

6

GeneBench-Pro: Evaluating Multistage Statistical Reasoning\\in Genomics, Quantitative Biology, and Translational Biomedicine

Li, J. H.; Ho, A. J.

2026-06-30 bioinformatics 10.64898/2026.06.29.735386 medRxiv

Top 0.4%

1.1%

Show abstract

We introduce GeneBench-Pro, an expanded and improved version of GeneBench that comprises harder problems across a wider breadth of domains. GeneBench-Pro is a benchmark for AI agents performing realistic multi-stage scientific analyses in genomics, quantitative biology, and translational biomedicine which seeks to capture the complexity of real-world problems that computational life scientists face when tasked with producing a conclusion upon which a downstream scientific or translational decision is contingent. The benchmark comprises 129 evaluations targeting quantities of direct practical relevance across 10 primary domains and 21 terminal subdomains, with a genomics-centered core. Similarly to GeneBench, each problem provides the agent with brief context, a target estimand, and minimal guidance otherwise; the agent must then navigate multiple dependent decision points; i.e., substantive inferential forks where a plausible wrong choice changes the downstream analysis, to identify and execute the correct analysis workflow and arrive at the correct answer. Relative to GeneBench, GeneBench-Pro adds 29 new problems, drops three, and introduces significantly redesigned versions of 54 of the remaining 100 overlapping problems. 82 of the 129 problems were reviewed by external domain experts, whose findings led to prompt/data modifications and redesign of those problems whose targets were not sufficiently identifiable. Ten externally reviewed problems are released publicly, 50 held-out problems were provided to Artificial Analysis for independent third-party model benchmarking, and the remainder are retained as an internal holdout. In evaluations over the full 129-problem suite, GPT-5.6 Sol reaches an eval-level pass rate of 28.7% at the max reasoning level, and GPT-5.6 Sol Pro reaches 31.5% in separately reported GPT Pro runs. GPT-5.5 reaches 12.0%, GPT-5.4 reaches 8.9%, and the strongest non-GPT baseline, Claude Opus 4.8, reaches 16.0%. As with GeneBench, models often complete substantial portions of the workflow but exhibit a consistent gap between noticing and acting by identifying local diagnostic signals but failing to propagate the implications to the corresponding analysis decision. As a result, models often select wrong estimators or persist on initially plausible but incorrect analysis paths. GeneBench-Pro therefore measures an emerging capability of long-horizon biological reasoning that remains unreliable.

7

MARiO: predicting cancer variant pathogenicity by integrating in silico evaluation and patient-level mutational contexts

Nakagawa, H.; Kamatani, T.; Ishibashi, N.; Aoyama, S.; Morioka, M.; Miya, F.; Ikeda, S.

2026-07-10 bioinformatics 10.64898/2026.07.07.736919 medRxiv

Top 0.5%

1.0%

Show abstract

Comprehensive genomic profiling (CGP) supports precision medicine in cancer care, but accurate assessment of missense variant pathogenicity, especially for variants without established consensus, remains challenging. Various computational tools have been developed for variant functional prediction, but most current tools rely solely on variant-level features and do not capture the clinical context of individual patients. To address this limitation, we developed MARiO (Missense Alteration Risk for Oncogenicity), a machine-learning model that integrates variant-level features and patient-level clinical and genomic contexts to effectively predict the pathogenicity of missense variants in cancer. We collected a total of 10,642 missense variants from 1271 patients, and evaluated candidate features for their association with variant pathogenicity, identifying informative features including in silico functional predictions, population allele frequency, variant allele frequency, and tumor mutational burden. Using these selected features, MARiO was developed with extreme gradient boosting. The model integrates multiple in silico prediction tools and patient-specific genomic contexts while accommodating missing values frequently observed in real-world CGP datasets. MARiO outperformed existing tools, achieving an area under the receiver operating characteristic curve of 0.942. The model demonstrated strong generalizability across multiple external datasets and showed consistency with real-world molecular treatment proposals. MARiO offers a robust and clinically relevant approach for missense variant pathogenicity assessment by integrating variant- and patient-level features and serves as a valuable tool to support clinical decision-making.

8

synpact: accurate, memory-light PacBio HiFi read mapping via a hierarchy of locally-consistent syncmer blocks

Aydin, M. S.; Sahlin, K.

2026-07-02 bioinformatics 10.64898/2026.06.28.735066 medRxiv

Top 0.6%

0.8%

Show abstract

Motivation: Mapping PacBio HiFi reads is a routine task and serves as a central step in many bioinformatics analyses. However, the most accurate long-read mappers have a high memory consumption and are slow. Some light-weight mappers have been proposed for faster runtime, but their accuracy is not comparable to state-of-the-art mappers. With the increasing number of available reference sequences, memory-efficient and fast methods for read mapping without the large accuracy drop are desired. A general trade-off with seed-chain-extend mappers is selecting a single, fixed seed size, which forces a compromise between sensitivity and specificity. Results: We present synpact, a long-read mapper that uses several seed sizes (a hierarchy) constructed with Locally Consistent Parsing (LCP) over syncmers. A read is mapped by querying for matches at different levels, followed by sliding window voting. By storing only the coarse upper levels rather than the full hierarchy, the index holds several times fewer entries, while still handling errors by falling back from coarser to finer stored levels at query time. We benchmark synpact against popular long-read mappers on four genomes and different read lengths. For simulated PacBio HiFi data, synpact matches or approaches minimap2 accuracy with higher precision in most cases, while using roughly 5-13 times less peak memory (e.g., about 0.8GB vs. 10.7GB on human) and mapping faster on large or repetitive genomes (e.g., about 10 to 13 times faster than minimap2 on rye). On real HiFi reads synpact has high concordance with minimap2 across the four genomes, as opposed to the other lightweight long-read mappers. Availability and Implementation: synpact is written in Rust and is available at https://github.com/mahmudsami/synpact

9

Safeguarding open-weight genomic foundation models through weight locking

Karatzikos, A.; Vasilopoulou, A.; Chan, C.; Mouratidis, I.; Georgakopoulos-Soares, I.

2026-07-10 bioinformatics 10.64898/2026.07.07.736795 medRxiv

Top 0.7%

0.8%

Show abstract

BackgroundGenomic foundation models can dramatically accelerate biological research by learning general-purpose representations of genomic data that transfer across tasks, enabling researchers to predict variant effects, regulatory elements, and molecular function, among others. To safeguard against potential biosecurity threats and malicious misuse of open-weight models, a common strategy involves excluding human-infecting viral genomes from the models training corpora. This strategy, however, can be easily circumvented by fine-tuning models on abundantly available viral data. Weight-locking with spectral deformation has been proposed as a potential method to prevent fine-tuning of neural networks, but has not been systematically evaluated in biological AI models. MethodsWe applied spectral deformation locking to the Evo-1-8k-base genomic foundation model and evaluated a panel of attack configurations spanning naive fine-tuning, low-rank adaptation (LoRA), a simple inserted-layer bypass baseline, and a white-box singular value decomposition (SVD)-chain factorisation at chain lengths k [isin] {2, 3, 5}. Recovered virological capability was quantified on three Human Virome Understanding Evaluation (HVUE) tasks. ResultsThe lock defended against the naive attacker by either standard pipeline. Naive full fine-tuning under the strong lock drove downstream virological capability significantly below the pretrained baseline on pathogenicity and host tropism, converting the attack into a capability loss rather than a gain, while naive low-rank adaptation neither moved held-out perplexity (PPL) nor recovered downstream capability above pretrained. Thus, we conclude that by neither route does the naive attacker reach the gain achieved by fine-tuning an unlocked model. Consistent with previous results in non-biological models, an informed attacker who implements the SVD-chain construction does recover capability on pathogenicity prediction, at the cost of increased computational requirements for the fine-tuning process. Availabilityhttps://github.com/Georgakopoulos-Soares-lab/glm-locking.

10

Towards a Unified Exact Solution of Rearrangement Small Parsimony for Natural Genomes

Bohnenkaemper, L.; Frolova, D.

2026-06-28 bioinformatics 10.64898/2026.06.23.733974 medRxiv

Top 0.7%

0.6%

Show abstract

Phylogenetic reconstruction is a fundamental problem in comparative genomics. As a theoretical problem in rearrangement studies, this has been modelled as the Small Parsimony Problem (SPP), in which ancestral genome structures have to be determined minimizing the number of rearrangement events occurring throughout the phylogeny. This problem is of significant interest in microbial and cancer genomics, due to the prevalence and clinical importance of rearrangement events. Genome structures in this problem are expressed as sequences of markers, which are themselves oriented sequence features (such as genes) that abstract from non-structural variations. Recent research has focused on the problem under the natural genomes model, in which arbitrary variations in copy number of markers are allowed. Natural genomes are often studied under the DCJ-indel model, a model which has already been successfully applied to plasmid data. There also exist ILP solutions to a variant of the Small Parsimony Problem under the DCJ-indel model. However, these solutions are limited in their applicability, as they make some critical simplifications for tractability purposes: ancestral marker frequencies and precomputed putative ancestral adjancencies, with their predicted likelihoods, are assumed as input. This creates multiple problems from both a theoretical and practical perspective. Firstly, this simplification means that not the full state space is searched for a solution, but rather only the subset of genomes with the precomputed putative adjacencies, meaning an optimal solution to the exact SPP is not guaranteed. Secondly, marker frequencies are given externally, without any theoretical guarantees. Thirdly, the method used to precompute adjacencies relies on gene trees, which requires the use of genes as markers, when gene annotation is often unreliable, especially in regions with a lot of rearrangement. Additionally, this restricts the applicability of the approach to sets of genomes that are both divergent and large enough to be able to produce informative gene trees. This is, for example, rarely the case for plasmids, where nucleotide mutations are rarer than rearrangements and genomes are small. Hence, we revisit the problem to solve the exact SPP by introducing a cost to indel operations, which allows us to compute ranges of marker frequencies and derive theoretical results, that allow us to reduce the solution space that the ILP searches without sacrificing optimality. We show that this makes the problem tractable for the case of small and recently related genomes, first on simulated genomes, and then on a set of pathogenic plasmids which represent a realistic use case for the method.

11

CSGDA: A Cell State-Guided Graph Domain Adaptation Network for Single-Cell Drug Response Prediction

Yan, F.; Cao, X.; Mao, F.; You, Z.; Chen, Y.; Du, Z.; Huang, Y.-A.

2026-07-08 bioinformatics 10.64898/2026.07.02.735966 medRxiv

Top 0.7%

0.6%

Show abstract

Intratumoral heterogeneity drives cancer recurrence and metastasis, yet single-cell drug response prediction faces severe "cross-domain" challenges, such as applying in vitro models to in vivo tissues or inferring metastatic resistance from primary tumors. These scenarios trigger distribution shifts arising from heterogeneous sequencing platforms, distinct tissue microenvironments, and metastatic evolution - problems rarely addressed by existing methods. We introduce CSGDA, a cell state-guided graph domain adaptation framework designed to predict drug responses across these biological heterogeneities. CSGDA incorporates biological priors to map gene expression into functional cell states, guiding a structure learning module to construct robust cell topology. To conquer distribution shifts, the model employs graph domain adaptation combined with a novel overlap penalty mechanism. Extensive benchmarks on five scRNA-seq datasets demonstrate that CSGDA outperforms state-of-the-art methods, achieving an average gain of ~6% in ACC and AUPR. Beyond prediction accuracy, we employed integrated gradients to effectively pinpoint key genes involved in drug resistance within a challenging cross-metastasis cisplatin dataset. These findings underscore CSGDA's superior performance in single-cell drug response prediction and its potential in resolving single-cell heterogeneity, paving the way for precision medicine.

12

HiFi-ST: High-Fidelity Reconstruction of Continuous Spatial Transcriptomic Expression Fields via Conditional Neural Fields

Li, H.; Tang, L.; Han, W.; Yang, X.; Chen, X.

2026-07-02 genomics 10.64898/2026.06.29.735170 medRxiv

Top 0.7%

0.6%

Show abstract

Spatial transcriptomics characterizes tissue-scale gene expression patterns, yet its observations are sparse discrete samples of an underlying continuous molecular field, leading to spatial aliasing and sub-resolution information loss. Existing methods usually formulate this task as spot-level point regression, making it difficult to capture both expression continuity and the regional nature of observation. Here, we propose HiFi-ST, a conditional neural field framework for continuous spatial transcriptomics modeling. HiFi-ST formulates spatial gene expression prediction as continuous expression field learning, models each spot as a regional observation over a finite support domain, approximates local integration through Monte Carlo sampling, and integrates multiscale tissue feature extraction with FiLM-based conditional modulation to improve modeling of complex spatial heterogeneity and consistency with the underlying measurement process. Systematic evaluation on three independent datasets (HER2+, cSCC, and Alex_NatGen) showed that HiFi-ST outperformed mclSTExp, BLEEP, THItoGene, His2ST, and HisToGene on key metrics. On HER2+, HiFi-ST achieved an average PCC improvement of 65.1% and an average MSE reduction of 40.9%; on cSCC, PCC improved by 10.2% and MSE decreased by 51.2%; on Alex_NatGen, PCC improved by 80.0% and MSE decreased by 16.3%. In addition, the learned multiscale tissue representations supported downstream spatial immunoanalysis, including assisted identification of candidate TLS regions. Overall, HiFi-ST provides a unified framework bridging discrete measurements and continuous expression field reconstruction for tumor microenvironment analysis and spatial immune structure characterization.

13

Beyond infinite sites: Generalized ABBA-BABA statistic for deeper phylogenies

Zhang, C.; Nielsen, R.

2026-07-08 bioinformatics 10.64898/2026.07.06.736715 medRxiv

Top 0.7%

0.6%

Show abstract

The Patterson's D statistic detects gene flow from ABBA-BABA site patterns, but its biallelic site patterns fail under deeper divergences where multiple hits cause false positives. We propose two extensions, D+ and D*. Both incorporate multiallelic site patterns to reduce saturation bias under JC and F84 model. Simulations show that D+ and D* both remain correctly null under all conditions and detect gene flow effectively, with distinct advantages: D+ guarantees non-negativity of the denominator, while D* provides greater robustness when mutation rates vary across genomic regions. The source code and binary files are publicly available at https://github.com/chaoszhang/ASTER.

14

Binary search and and set operations on compacted k-mer lists

Dufresne, Y.; Andreace, F.

2026-07-03 bioinformatics 10.64898/2026.06.29.735436 medRxiv

Top 0.8%

0.6%

Show abstract

Sorted lists of elements are particularly good for computing set operations. A single scan of the two lists is sufficient to materialize or count the results of the union, intersection, difference, and xor operators. In bioinformatics, only a few tools are designed to perform these operations on k-mers. A fast tool like KMC allows set operations at the cost of storing individual k-mers. In this paper, we introduce a novel way to represent sorted k-mers as a collection of recomposed super-k-mer sorted lists. We introduce the concept of virtual super-k-mer and show how to construct, query and perform set operations on sorted lists of virtual super-k-mers. In the implementation sklib, we demonstrate high throughput of the data structure for construction and set operations, while remaining competitive in query capabilities, within a controlled memory footprint (2-5x decrease in bits/element compared to KMC).

15

Multi-modality Graph Representation Learning for Malignant Cell Identification from scRNA-seq using DeepMalignant

Bhattarrai, P.; Yuan, W.; Chi, H.; Zhou, X. M.; Mallory, X.

2026-07-03 bioinformatics 10.64898/2026.06.29.734828 medRxiv

Top 0.8%

0.6%

Show abstract

Distinguishing malignant from normal cells in single-cell RNA sequencing data remains a critical yet challenging task in cancer genomics. Existing methods often suffer from poor precision, limited generalizability across cancer types, and reduced robustness across different sequencing platforms. We developed DeepMalignant, an unsupervised multimodal graph attention autoencoder for malignant cell identification that jointly integrates gene expression and copy number alteration (CNA) information. We applied DeepMalignant to five datasets covering 26 samples and four cancer types (breast, colorectal, pancreatic, and ovarian cancers), generated by three platforms (10x Genomics, inDrop, and Drop-seq) for benchmarking and compared it with existing state-of-the-art methods including scMalignantFinder, PreCanCell, CopyKAT, ikarus, and Cancer-Finder. DeepMalignant achieved the best overall balance of precision and recall and consistently outperformed the existing methods that used either gene expression or CNA in F1 scores. Ablation studies showed that both CNA-based edge weighting and graph attention aggregation contribute independently to performance, and attribution analysis further indicated that the learned embeddings capture biologically meaningful malignant programs. We further applied DeepMalignant to two ductal carcinoma in situ (DCIS) samples, DCIS2 and DCIS1, that have matched spatial transcriptomics and scRNA-seq data. DeepMalignant identified tumor-enriched regions that were highly consistent with the matched histological image. The downstream cell-cell communications analysis revealed that fibroblast-derived C3 and MIF both directed signaling more toward normal epithelial cells than tumor epithelial cells, demonstrating that accurate tumor-normal cell classification by DeepMalignant enables biologically meaningful interrogation of the tumor microenvironment and revealing how stromal cells differentially communicate with malignant versus normal epithelial populations.

16

Neural Processes with Normalizing Flows for Wheat Height Estimation

Boss, M.;Volpi, M.;Roth, L.

2026-07-09 Plant Biology 10.64898/2026.06.24.734247 medRxiv

Top 0.8%

0.5%

Show abstract

In this work, we investigate modeling plant traits over time using neural processes, a class of machine learning models that learn distributions over functions. Plant growth is an inherently stochastic process with complex dynamics measured mostly at irregular times throughout the growing seasons. While individual trait trajectories may be simple, their distributions are shaped by complex interactions between genotype, environment, and other factors. In particular, we focus on plant height in wheat, a deceptively simple-looking trait with complex dynamics. To model these trajectory distributions, we evaluate neural processes and in particular extensions using normalizing flows, with different combinations of genotype and environmental covariates. For controlled evaluations, we generate synthetic wheat height trajectories calibrated against Swiss weather station records and the FIP1 dataset. To fully evaluate these trajectory distributions, we use signatures, vector representations of sequential data, together with Sig-MMD and the recently introduced CSig-MMD. Sig-MMD enables direct pathwise comparison of predicted and simulator trajectory distributions, while CSig-MMD focuses this comparison on the tail, including lodged trajectories. Together, these metrics allow us to assess whether the models capture the full distribution of growth trajectories, including rare outcomes.

17

GCBM-DCT-HV-Bio-NL-Grow-CHG-CSM-RHEC: A Unified Geometric, Biological, Causal, and Regenerative Framework for Mechanism-Aware Tissue and Connectome Modeling

Xu, T.; Hu, Z.; Sun, X.; Jin, L.; Xiong, M.

2026-06-29 bioinformatics 10.64898/2026.06.24.734320 medRxiv

Top 0.8%

0.5%

Show abstract

Modern biological prediction problems increasingly require models that go beyond Euclidean feature regression and local graph smoothing. Tissue, cellular, and connectome systems are nonlinear, geometry-dependent, intervention-sensitive, history-dependent, and subject to regenerative or homeostatic constraints. We propose GCBM/DCT/HV/Bio/NL/Grow/CHG/CSM/RHEC, a unified model for mechanism-aware biological prediction. The model integrates geometric connectome dynamics, differentiable charted tissue geometry, Hamiltonian latent transport, nonlinear biological kinetics, nested latent memory, continual growth without overwriting, causal hypergraph structure, causal structure modeling, and regenerative homeostatic error correction. Unlike Euclidean baselines, which treat observations as flat vectors, and local graph baselines, which use neighborhood smoothing without mechanistic structure, the proposed model represents biological states (Trapnell 2015) as coupled geometric, dynamical, causal, and regenerative objects. We evaluate the model on four synthetic toy studies, Toy A, B,C, D, designed to reflect increasing biological complexity: local Euclidean structure, nonlinear mechano-chemical interaction, causal intervention response, and out-of-distribution regenerative shift. Compared with Euclidean and local graph baselines, the full model achieves the lowest mean squared error across all four toy studies. Relative to the Euclidean baseline, the full model reduces MSE by approximately 63.0%, 89.1%, 89.0%, and 90.9% on Toy A, Toy B, Toy C, and Toy D, respectively. These results support the value of integrating geometry, mechanism, causal structure, adaptive growth, and regenerative correction into a single predictive architecture (Figure 1).

18

Tissue tearing degrades optimal-transport and diffeomorphic registration of spatial transcriptomics beyond displacement magnitude: a multi-seed deformation benchmark and a supervised graph cross-attention proof-of-concept.

Maniar, R. K.; Lee, S. G.; Lee, S. S.

2026-07-05 bioinformatics 10.64898/2026.06.30.735390 medRxiv

Top 0.8%

0.5%

Show abstract

Background. Three-dimensional reconstruction from serial spatial-transcriptomics (ST) sections requires registering adjacent slices, but physical sectioning introduces tears -- discontinuous, non-isometric deformations. Leading methods rely on priors that tears strain: PASTE/PASTE2 use Fused Gromov-Wasserstein optimal transport (OT), which assumes near-isometric preservation of within-slice distances, while STalign and CODA use diffeomorphic (LDDMM) mapping, which cannot change tissue topology. Learned-deformation ST methods are emerging (STaCker, INST-Align), but OT/diffeomorphic behaviour under tearing has not been systematically characterised. Methods. On the spatialLIBD human DLPFC Visium dataset (Maynard et al., 2021; 3 donors), we build a controlled benchmark -- known smooth warps, single-block rigid tears (expression unchanged), and an identity self-control -- at severities of 0-8 spot pitches, scored against an approximate array-position ground truth (~8 px residual). We evaluate three unsupervised incumbents -- PASTE2 (OT, over five warp seeds), STalign (diffeomorphic LDDMM), and GPSA (Gaussian-process warp) -- add a magnitude-matched smooth control, and test a minimal graph model, Sutura (per-slice graph encoder -> cross-attention correspondence -> per-spot displacement; spatial coupling is local kNN message passing only, no explicit smoothness penalty). Sutura is trained supervised on each tissue's ground truth; all baselines are unsupervised. Generalisation is assessed by leave-one-donor-out across all three donors. Results. OT registration is robust to smooth warps but degrades reproducibly under tearing: nearest-correspondence (argmax) error 722 +/- 5 -> 855 +/- 27 px and layer accuracy 64.9% -> 60.5% (mean +/- 95% CI, 5 seeds). The effect is not merely displacement magnitude: at a matched mean displacement (~2000 px), a smooth warp costs 769 px / 60.2% accuracy whereas a tear costs 863 px / 57.5% -- an extra ~100 px and ~3 points attributable to the discontinuity. STalign (LDDMM) and GPSA (GP warp) both collapse at severe tears (866 px and 931 px respectively), confirming tear-collapse is field-wide across three independent method families. Trained and evaluated on the same donor, Sutura fits torn-tissue correspondence to a median 99 -> 106 px (5-seed), but under leave-one-donor-out is 1236 +/- 2 -> 1584 +/- 52 px -- approximately 1.8-3.6x worse than PASTE2 on every unseen donor. A contrastive correspondence loss halves the gap on two of three donors (to 816 -> 949 and 749 -> 826 px, approximately 1.1-1.2x PASTE2 at worst-case tear) but is modest on the third and never surpasses PASTE2. Conclusion. Tearing is a real, magnitude-controlled failure mode of all three incumbent method classes. A learned model fits it in-sample but donor-invariant generalisation remains open. The contrastive fix roughly halves the held-out gap on two of three donors and nears PASTE2 at worst-case tear, but does not surpass it: donor-invariance is improved, not solved. The durable contribution is the benchmark, the characterisation across three method families, and an honest negative with a diagnosed mechanism.

19

ProtBLIP2-SST: Protein Function Prediction via BLIP2 with Sequence, Structure, and Text

Chen, Z.; Luo, Q.

2026-07-12 bioinformatics 10.64898/2026.07.10.737551 medRxiv

Top 0.9%

0.5%

Show abstract

Protein function prediction traditionally relies on structured gene ontology (GO) labels or multi-label classifiers. However, these labels or classifiers cannot flexibly describe molecular function, biological process, cellular component, and free-text functional narratives in a single output. In comparison, generation-based approaches offer an intuitive paradigm for flexible free-text protein annotation, with large language models (LLMs) as a representative method for protein-text modeling. Recent efforts on utilizing LLMs for protein semantic understanding and annotation generation have adopted sequence-only encoding or sequence-text contrastive alignment paradigms, yet without explicit consideration of three-dimensional structural information. To address these limitations in current protein function prediction methods, we present ProtBLIP2-SST, a two-stage framework built on the BLIP2 model architecture that bridges protein sequence, structure, and text for open-ended protein functional caption generation. Specifically, we first integrate sequence and structure information through SaProt, a protein language model (PLM) with a structure-aware vocabulary that fuses residue tokens with Foldseek-derived 3Di structural tokens. To empower the LLM to understand protein semantics, we employ a Q-Former (a querying transformer in BLIP2) with learnable query tokens as the cross-modal projector to align protein features from the frozen SaProt encoder and text features from a frozen BiomedBERT via protein-text contrasting, protein-text matching, and protein captioning objectives. After alignment, the protein features are linearly projected and prepended to the prompt embeddings of the LLM for protein captioning fine-tuning with LoRA. Trained on 441k protein-text pairs from Swiss-Prot with corresponding structures from the AlphaFold Database, our ProtBLIP2-SST outperforms sequence-only and sequence-text alignment baselines on protein captioning metrics, with ablation studies demonstrating the effectiveness of integrating structure with sequence information for improved protein understanding. Through a unified two-stage alignment-and-generation pipeline, ProtBLIP2-SST integrates protein sequence and structural information, overcomes the rigidity of traditional GO-centric classification, generating open-ended captions that jointly describe molecular function, subcellular location, and homology context in one single output.

20

A Bayesian Network-Based Framework for Causal Cancer Drug Target Discovery Integrating Patient and Cell Line Data

Yoon, S. H.; Park, Y. R.; Kim, H. U.

2026-07-13 bioinformatics 10.64898/2026.07.13.736676 medRxiv

Top 0.9%

0.5%

Show abstract

Current approaches to cancer drug target discovery face two key limitations: poor translation of cell line-derived targets to patient tumors, and the lack of causal explanation of the regulatory mechanisms underlying target prioritization. Here we present BayesTx (Bayesian Therapeutics target discovery), a Bayesian network framework that integrates patient transcriptomics data with cell line data to identify causal therapeutic targets in cancer. BayesTx projects both data domains into a shared biological space of pathway and transcription factor activities, learns domain-specific causal graphs, and merges them through weighted edge aggregation with bootstrap consensus filtering. Do-simulation on the consensus network quantifies the causal effect of each transcription factor on cancer cell viability. Applied to breast cancer using TCGA-BRCA (The Cancer Genome Atlas breast cancer cohort) and DepMap (Cancer Dependency Map) datasets, the framework ranked 47 transcription factors by predicted causal impact, with gene-level targets further derived through regulon-based propagation. Top-ranked transcription factor (TF) targets were independently supported by survival analysis in external cohort data and pharmacogenomic drug response associations. Overall, BayesTx demonstrates that cross-domain Bayesian network modeling can bridge patient and cell line data to systematically identify causal therapeutic targets in cancer.